Inspiration
In the session we watched Hans Rosling’s “200 countries and 200 years in 4 minutes”, which we (hopefully) agreed is something to aspire to. Combined with his enthusiastic presentation, the visualisations in this clip support a clear narrative and help us understand a complex dataset.
The plot he builds is interesting because it uses many different visual attributes (aesthetics) to express features in the data, including:
- X and Y axes
- Colour
- Size of the points
- Time (in the animation)
These features are carefully selected to highlight important features of the data and support the narrative he provides. Although we need to have integrity in our plotting (we discuss bad examples in the session), this narrative aspect of a plot is important: we always need to consider our audience.
Before you start
Use the ‘files pane’ in RStudio to make a new folder on the RStudio
server to save your work. Call this datafluency2022.
Inside this new folder, make a new RMarkdown file (use the ‘file’
menu and choose ‘new’). When you save the file make sure it has the
extension .rmd, so call it datavis.rmd for
example.
Use this new .rmd file to save your work during this
session.
Recreate the Rosling plot
“Multi-dimensional plotting” sounds fancy, but it just means linking different visual features of a plot to columns in the dataset.
In the example, Rosling’s plot is appealing and informative because it adds multiple dimensions, and uses a special logarithmic scale for the x-axis.
The Rosling plot shown above has dimensions
The additional dimension in the plot shown in the BBC video is:
- x axis (GDP)
- y axis (life exp.)
- color (continent)
- size (population)
Time is the extra dimension, because it shows these other values changing across years.
Defining dimensions/aesthetics
As a reminder: ggplot uses the term
aesthetics to refer to different dimensions of a
plot.
‘Aesthetics’ refers to ‘what things look like’, and the
aes() command in ggplot maps variables
(columns in the dataset) to visual features of the plot.
There are 4 visual features (aesthetics) of plots we will use in this session:
xandyaxescoloursize(of a point, or thickness of a line)
We could also use:
shape(of points)linetype(of added lines, i.e. dotted/patterned or solid)
Additionally, we will control the scale of the axes in the
plot to improve the presentation of the data.
video("U_RZzDEgM-Y")”
Rosling’s plot looked something like this:
To create a (slightly simplified) version of the plot above, the code would look something like this:
development %>%
filter(BLANK==BLANK) %>%
ggplot(aes(x=BLANK,
y=BLANK,
size=BLANK,
color=BLANK)) +
geom_point() +
scale_x_log10() +
labs(x=BLANK, y=BLANK, color=BLANK, size=BLANK)I have removed some parts of the code. Your job is to edit the parts
which say <BLANK> and replace them with the names of
variables from the development dataset (available in the
psydata package).
Some hints:
- All the
BLANKs represent variable names in the dataset. You can see a list of the column names available by typingglimpse(development) - If you are confused by the
filter(BLANK==BLANK)check the title of the plot above. Remember thatfilterselects particular rows from the data, so we can use it to restrict what is shown in the plot. What data do we need to select for this plot? - The part which reads
scale_log_10()is explained in more detail below - The part which reads
labs(...)is explained below.
Using multiple layers
When visualizing data, there’s always more than one way to do things.
As well as plotting different dimensions, different types of
plot can highlight different features of the data. In ggplot, different
types of plots are called geometries. Multiple layers
can be combined in the same plot by adding together commands which have
the prefix “geom_”.
As we have already seen, geom_point() is used to create
a scatter plot:
To add additional layers to this plot, we can add extra
geom_<NAME> functions. For example,
geom_smooth is used to overlay a smooth line to any x/y
plot:
fuel %>%
ggplot(aes(weight, mpg)) +
geom_point() +
geom_smooth()Explanation of the command: We added
+ geom_smooth() to our previous plot. This means we now
have two geometries added to the same plot: geom_point and
geom_smooth.
Explanation of the output: The plot shown is the
same as the previous scatterplot, but now has a smooth blue line
overlaid. This represents the local-average of mpg, for
each level of weight. There is also a grey-shaded area,
which represents the standard error of the local average (again there
will be more on this later in the course).
We could add other layers to the plot with other geom_
functions.
For example, we could calculate the average mpg and
weight of all cars in the dataset and overlay that as a horizontal or
vertical line:
med_mpg <- fuel %>% summarise(median(mpg)) %>% pull(1)
fuel %>%
ggplot(aes(weight, mpg)) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = med_mpg, color="red") Explanation of the code First we calculate
the median of mpg using summarise. Then we use the
pull(1) command for the first time. This selects the first
column of data (containing our median) and returns only that. That is,
it returns only a number, or a list of numbers, rather than the whole
dataframe. Then We added geom_hline and
geom_vline functions to our existing plot. We set the
yintercept option to the stored med_mpg value,
and this defines the height of the line. We added
color="red" to make these added lines distinctive.
Explanation of the output The plot is the
same as before, but now has a red line marked at the mean of the
mpg column. The red line is on top of the other plot
elements because we added this line at the end of our plotting code.
Make your own layered plot
Use the
mentalhdata-set frompsydata. Create a scatter plot of screen time and anxiety scores, adapting the code above.Add a smoothed line to the plot using
geom_smooth()Colour the plot, using the
educationvariable.Add a horizontal line to the plot, with the
yinterceptset to the average anxiety score.
med_anx <- mentalh %>% summarise(median(anxiety_score)) %>% pull(1)
mentalh %>%
ggplot(aes(screen_time, anxiety_score, color=education)) +
geom_point() +
geom_smooth(se=F) +
geom_hline(yintercept = med_anx)Scales
Incomes are “not normal”
If we re-plot the development data using the default settings in ggplot you might notice that the result looks quite different to the one shown in the video or the exercise above.
In particular, the placement of the points looks quite different:
Specifically, we can see that in the left hand panel the points are mostly compressed to the left hand of the frame. In contrast, in the original plot, the points are fairly evenly spread across the x-axis.
These plots are showing the same data. The only difference is that the original plot uses a log scale.
We can recognise the log scale by looking at the markings on the x axis:
- In the left hand panel the markings go up by 30,000 each time.
- In the original plot, each marker represents a value 10 times
larger than the previous. So, 1000, 10,0000 and 100,000 (the values
shown are in scientific
notation, so
1e+03means 1000).
Skewed distributions
Another way to see why this helps is to plot the distribution of incomes:
development %>%
ggplot(aes(gdp_per_capita)) +
geom_histogram()Explanation of the code We used the
geom_histogram function to make histogram of the GDP per
capita variable.
Explanation of the output The histogram shows that most GDP values are below $20,000, but a small number are much, much larger (i.e. > $100,000). This is quite typical of incomes data.
We can then add scale_x_log10() to the same plot:
development %>%
ggplot(aes(gdp_per_capita)) +
geom_histogram() +
scale_x_log10()Explanation of the code We make another,
histogram this time adding scale_x_log10().
Explanation of the output The plot changes, and the distribution is less skewed. We can see that the scale markers are again in scientific notation, and the gaps between points of the scale are not equal: each point on the scale is 10 times larger than the previous one (1000, 10,000, 100,0000), stretching out the values across the x axis and reducing the skew.
Reaction times
In the example above we saw that incomes were not normally distributed and benefited from a log scale. Another common example of ‘non-normal’ data are those from reaction time studies.
For example:
rtdata <- read_csv('https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv')
rtdata %>%
ggplot(aes(rt)) +
geom_histogram()Explanation of the code The line with
read_csv takes a web address (URL) and reads a ‘comma
separated values’ data file from it. The next part makes a histogram
using the reaction time (RT) data it contains. In case it is truncated
in the output above, the full url is: https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv
Explanation of the output The RT data are strongly skewed. Consequently the median, 2.01, is lower than the mean value, 2.51.
- Copy and paste the line of code which reads the CSV data and run it.
- Recreate the histogram above (use
geom_histogram) or a density plot (usegeom_density) - Add the correct scale function to recreate this plot:
rtdata <- read_csv('https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv')
rtdata %>%
ggplot(aes(rt)) +
geom_histogram() +
scale_x_log10()Or
rtdata <- read_csv('https://raw.githubusercontent.com/lindeloev/shiny-rt/master/mrt_data.csv')
rtdata %>%
ggplot(aes(rt)) +
geom_density() +
scale_x_log10()`
Animation!
This is an optional exercise and is not required for the course assessment. Skip to the final section if you are short on time.
The gganimate package allows us to create animations
using ggplot. The package has good documentation here: https://gganimate.com.
As an example we can load the package:
library(gganimate)And then adapt our previous ggplot by adding
transition_time(year). This adds year as a
time-based dimension, animating the plot.
We need to save the resulting plot in a variable, and then send that
to the animate function. Here we use a variable called
progress_plot.
progress_plot <- development %>%
ggplot(aes(gdp_per_capita, life_expectancy, color=continent)) +
geom_point() +
scale_x_log10() +
transition_time(year)
progress_plot %>% animate()Try to animate the following plot using data from all the years in
the development dataset. To do this, amend the code below,
referring to the example above:
development %>%
filter(year == 1952) %>%
ggplot(aes(continent, life_expectancy)) +
geom_boxplot() +
labs(title="Year: 1952")You need to:
- Remove the filter
- Use the
transition_time()function
library(tidyverse)
library(psydata)
library(gganimate)
p <- development %>%
ggplot(aes(continent, life_expectancy)) +
geom_boxplot() +
labs(title = "Year: {frame_time}") +
transition_time(year)
p %>% animate()Graphics are for answering questions
Note: you do not need to have completed the extension exercise using animation to engage with this activity.
Consider the following plots, which include the animation we made above:
Animated boxplot
Boxplot
Line graph
This plot shows the median life expectancy in 1952 and 2002.
Ribbon plot
This plot shows the median (line) and IQR (shaded area) in all available years:
development %>%
ggplot(aes(year, life_expectancy, fill=continent, color=continent)) +
stat_summary(geom="ribbon", fun.data = median_iqr, alpha=.2, linetype=0) +
stat_summary(geom="line", fun.data=median_iqr) +
theme_minimal() +
labs(title = "Life expectancy between 1952 and 2022 (median and IQR)", x="Year",
y="Life expectancy (years)",
fill="Continent", color="Continent") - Create a table of the strengths and weaknesses of each plot, like so:
| Plot | Strengths | Weaknesses |
|---|---|---|
| Animated boxplot | … | … |
| Boxplot | … | … |
| Line graph | … | … |
| … | … | … |
- Think of some questions that can be answered from these data? For example:
- Which continent changed most between 1952 and 2002?
- Which continent has the most/least variability in life expectancy?
- Which continents became more/less homogeneous over this period?
And so on… Make a list of at least 5 or 6 questions which someone might want to know the answer to.
Which plots are most effective in answering each of your questions?
Imagine you were a journalist writing a story with the title “Asia sees fastest rise in life expectancy since WW2”. Can you think of any reasons to prefer the line graph over the boxplot?